Group Relative Policy Optimization (GRPO) Explorer

GRPO Objective Function

\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_Q \left[ \frac{1}{G} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}_{\text{KL}} (\pi_\theta \| \pi_{\text{ref}}) \right]

The formula represents the objective that GRPO aims to maximize to optimize a policy π_θ.

Policy and Group Configuration

Number of Groups (G):

Group (i)	π_{θ_old}(o_i\|q)	π_θ(o_i\|q)	Reward (r_i)

Hyperparameters

Clip parameter (ε): 0.2

KL penalty (β): 0.01

Advantage Calculation

A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})}

Statistics

Mean(rewards): 0
Std(rewards): 0

Group (i)	Advantage (A_i)

Clipped Surrogate Objective

\min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)

Group (i)	Ratio	Clipped Ratio	Unclipped Term	Clipped Term	Min(Terms)

KL Divergence Penalty

\mathbb{D}_{\text{KL}} (\pi_\theta \| \pi_{\text{ref}}) = \sum_i \pi_\theta(o_i|q) \log \frac{\pi_\theta(o_i|q)}{\pi_{\text{ref}}(o_i|q)}

KL Divergence Value

How much the current policy differs from the reference

KL Penalty Term

β × KL Divergence

GRPO Objective Value

Surrogate Objective

Average of clipped terms

Final Objective Value

Surrogate - KL Penalty

GRPO Components Explanation

The policy ratio π_θ(o_i|q) / π_{θ_old}(o_i|q) measures how much more (or less) likely the new policy is to select output o_i compared to the old policy.

Clipping this ratio to the range [1-ε, 1+ε] prevents large policy updates, ensuring stability. This is a hallmark feature of Proximal Policy Optimization (PPO).

When the advantage A_i is positive, clipping limits how much the policy can improve for that action. When A_i is negative, clipping limits how much the policy can decrease the probability of that action.

The advantage function A_i quantifies how good or bad selecting output o_i is compared to the average.

It's calculated by normalizing the rewards to zero mean and unit variance: A_i = (r_i - mean(rewards)) / std(rewards)

This normalization stabilizes training across varying reward scales:

A positive advantage (A_i > 0) means the action is better than average.
A negative advantage (A_i < 0) means the action is worse than average.

The KL Divergence penalty encourages the policy π_θ to stay close to a reference policy π_ref.

The formula is: D_KL(π_θ || π_ref) = Σ π_θ(o_i|q) log(π_θ(o_i|q) / π_ref(o_i|q))

In this dashboard, we're using the old policy as the reference policy for simplicity.

The hyperparameter β controls the strength of this penalty. Higher values of β discourage large policy changes.

The final GRPO objective combines the clipped surrogate objective with the KL divergence penalty:

J_GRPO(θ) = E_Q[1/G Σ_i min(ratio × A_i, clipped_ratio × A_i) - β × D_KL]

The algorithm aims to maximize this objective function, which means:

Increasing the probability of actions with positive advantages
Decreasing the probability of actions with negative advantages
While staying close to the previous policy (via the KL penalty)

This balance leads to stable and efficient policy improvement.

This interactive explorer was created with Claude Sonnet 3.7. Here is the prompt with the DeepSeek R1 paper (see below) attached: make an interactive dashboard in html/js/bootstrap to understand the GRPO with connected widgets in a dynamic way.

By Nicolas Martin - Fractal-Apps - 4/2025

Unveiling DeepSeek-R1: A Reinforcement Learning Journey to Advanced Reasoning

Large Language Models (LLMs) are rapidly advancing, bringing us closer to the vision of Artificial General Intelligence (AGI). A critical aspect of this evolution is enhancing their reasoning capabilities. While approaches like increasing Chain-of-Thought (CoT) length during inference have shown promise, the challenge of effective test-time scaling remains. DeepSeek-AI introduces DeepSeek-R1 and DeepSeek-R1-Zero, their first-generation reasoning models, which explore the power of reinforcement learning (RL) to push the boundaries of LLM reasoning.

The journey began with DeepSeek-R1-Zero, a model trained through large-scale reinforcement learning without the initial step of supervised fine-tuning (SFT). This was a significant undertaking, aiming to investigate the potential for LLMs to develop reasoning abilities purely through self-evolution via RL. Using DeepSeek-V3-Base as the foundation and employing Group Relative Policy Optimization (GRPO) as the RL framework, DeepSeek-R1-Zero remarkably demonstrated the emergence of powerful and intriguing reasoning behaviors. This included capabilities like self-verification, reflection, and generating lengthy Chains of Thought, validating that RL alone can incentivize reasoning.

Exploring the Core: Group Relative Policy Optimization (GRPO)

At the heart of DeepSeek-R1's training lies Group Relative Policy Optimization (GRPO). This RL algorithm is designed to be more computationally efficient by foregoing a critic model, typically the same size as the policy model. Instead, GRPO estimates a baseline from group scores.

Imagine you're trying to teach a model to solve a complex problem through trial and error. In traditional RL, a critic would evaluate each attempt individually. GRPO, however, takes a different approach. For a given problem, it samples a group of potential solutions generated by the model's current strategy. It then evaluates the quality of each solution within that group, and importantly, calculates an "advantage" for each solution based on how much better or worse it is compared to the average performance of the group. This group-relative comparison helps the model understand which variations in its approach led to better outcomes, allowing it to refine its strategy more efficiently.

An Interactive Peek into GRPO

To truly grasp GRPO, let's conceptualize an interactive explorer. Think of a dashboard where you could adjust key variables and observe their impact on the learning process:

Group Size (G): How many solutions are sampled in each group? A larger group might provide a more stable baseline for comparison but requires more computation. A smaller group could lead to faster updates but might be more susceptible to noise. How does changing this value affect the speed and stability of learning?
Reward Signals: In DeepSeek-R1-Zero, rule-based rewards were used, primarily focusing on accuracy and output format. In DeepSeek-R1, language consistency and even generative reward models were incorporated. How would tweaking the weights or types of reward signals influence the model's learned behaviors? Would prioritizing accuracy over readability lead to a less user-friendly but more capable model?
Policy Update Parameters (ϵ, β): GRPO uses parameters like ϵ and β in its objective function to control how much the model's strategy is allowed to change based on the calculated advantages. How do these parameters influence the exploration-exploitation trade-off? A high ϵ might encourage more exploration of different approaches, while a low one might lead to faster convergence on a potentially suboptimal strategy.

This interactive exploration highlights the nuanced interplay of factors within GRPO that contribute to the model's learning and the emergence of complex reasoning abilities.

From Zero to R1: Addressing Challenges and Enhancing Performance

While DeepSeek-R1-Zero showcased the power of pure RL, it faced challenges, notably poor readability and language mixing in its outputs. To address these and further boost performance, DeepSeek-R1 was introduced, incorporating a multi-stage training pipeline with a "cold start".

The cold start involved fine-tuning the DeepSeek-V3-Base model with a small dataset of high-quality, human-friendly Chain-of-Thought examples. This provided an initial boost in readability and guided the model towards a more desirable output format. The training then proceeded with reasoning-oriented RL, similar to DeepSeek-R1-Zero, but with the addition of a language consistency reward to mitigate mixing issues.

The pipeline didn't stop there. After the reasoning-oriented RL converged, a stage of rejection sampling and supervised fine-tuning was introduced. This involved generating new SFT data by sampling from the RL checkpoint and combining it with supervised data from other domains like writing and factual QA. This aimed to enhance the model's general capabilities alongside its reasoning prowess. Finally, a second RL stage was implemented, considering prompts from all scenarios to further align the model with human preferences for helpfulness and harmlessness.

This multi-stage approach of DeepSeek-R1, starting with a cold start and iteratively refining through RL and SFT, demonstrates an ingenious strategy to combine the exploration power of RL with the guidance of supervised data to achieve both high performance and user-friendly outputs.

Innovations Beyond Training: Distillation for Accessibility

DeepSeek-AI didn't just focus on training large, powerful models; they also explored how to make these advancements more accessible. A key innovation is the distillation of DeepSeek-R1's reasoning capabilities into smaller dense models. By fine-tuning models like Qwen and Llama using data curated with DeepSeek-R1, they demonstrated that the reasoning patterns learned by larger models can be effectively transferred to smaller ones.

This distillation process is significant because it allows smaller, more efficient models to achieve reasoning performance comparable to larger models. The open-sourcing of these distilled models contributes valuable resources to the research community, enabling broader access to advanced reasoning capabilities.

Performance and Impact

DeepSeek-R1 has demonstrated impressive performance across a range of benchmarks. It achieved results comparable to OpenAI-01-1217 on reasoning tasks like AIME 2024 and MATH-500. On coding tasks, it reached an expert level on Codeforces. Beyond reasoning, DeepSeek-R1 also excels in knowledge-based tasks and general-purpose applications like creative writing and summarization. The distilled smaller models also set new records on reasoning benchmarks among dense models.

The work on DeepSeek-R1 and DeepSeek-R1-Zero highlights the immense potential of reinforcement learning in developing advanced reasoning abilities in LLMs. By open-sourcing their models and detailing their training methodologies, DeepSeek-AI provides valuable insights and resources for the research community to continue exploring the frontiers of AI reasoning. The ingenuity of their multi-stage training pipeline and the effectiveness of their distillation process pave the way for more capable and accessible language models in the future.

References

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Menu

Group Relative Policy Optimization (GRPO) Explorer

GRPO Objective Function

Policy and Group Configuration

Hyperparameters

Advantage Calculation

Statistics

Clipped Surrogate Objective

KL Divergence Penalty

KL Divergence Value

KL Penalty Term

GRPO Objective Value

Surrogate Objective

Final Objective Value

GRPO Components Explanation

Policy Ratios and Clipping

Advantage Function

KL Divergence Penalty

Final Objective

Unveiling DeepSeek-R1: A Reinforcement Learning Journey to Advanced Reasoning

Exploring the Core: Group Relative Policy Optimization (GRPO)

An Interactive Peek into GRPO

From Zero to R1: Addressing Challenges and Enhancing Performance

Innovations Beyond Training: Distillation for Accessibility

Performance and Impact

References